Translate, Predict or Generate: Modeling Rich Morphology in Statistical Machine Translation
نویسندگان
چکیده
We compare three methods of modeling morphological features in statistical machine translation (SMT) from English to Arabic, a morphologically rich language. Features can be modeled as part of the core translation process mapping source tokens to target tokens. Alternatively these features can be generated using target monolingual context as part of a separate generation (or post-translation inflection) step. Finally, the features can be predicted using both source and target information in a separate step from translation and generation. We focus on three morphological features that we demonstrate through a manual error analysis to be most problematic for English-Arabic SMT: gender, number and the determiner clitic. Our results show significant improvements over a state-ofthe-art baseline (phrase-based SMT) of almost 1% absolute BLEU on a medium size training set. Our best configuration models the determiner as part of core translation and predicts gender and number separately, and handles the rest of the features through generation.
منابع مشابه
Modeling Inflection and Word-Formation in SMT
The current state-of-the-art in statistical machine translation (SMT) suffers from issues of sparsity and inadequate modeling power when translating into morphologically rich languages. We model both inflection and word-formation for the task of translating into German. We translate from English words to an underspecified German representation and then use linearchain CRFs to predict the fully ...
متن کاملImproving the Performance of English-Tamil Statistical Machine Translation System using Source-Side Pre-Processing
Machine Translation is one of the major oldest and the most active research area in Natural Language Processing. Currently, Statistical Machine Translation (SMT) dominates the Machine Translation research. Statistical Machine Translation is an approach to Machine Translation which uses models to learn translation patterns directly from data, and generalize them to translate a new unseen text. T...
متن کاملHow to overtake Google in MT quality - the Baltic case
Motivation of the language technology company Tilde is to improve quality of machine translation for lesser resourced languages such as the languages of Baltic countries. Generic MT solutions like Google Translate perform poorly for these complex languages. To compensate the shortage of training data and to deal with rich morphology we are applying different approaches in combining statistical ...
متن کاملThe tÜBITAK-UEKAE statistical machine translation system for IWSLT 2009
We describe our Arabic-to-English and Turkish-to-English machine translation systems that participated in the IWSLT 2009 evaluation campaign. Both systems are based on the Moses statistical machine translation toolkit, with added components to address the rich morphology of the source languages. Three different morphological approaches are investigated for Turkish. Our primary submission uses l...
متن کاملJoint Morphological-Lexical Language Modeling for Machine Translation
We present a joint morphological-lexical language model (JMLLM) for use in statistical machine translation (SMT) of language pairs where one or both of the languages are morphologically rich. The proposed JMLLM takes advantage of the rich morphology to reduce the Out-Of-Vocabulary (OOV) rate, while keeping the predictive power of the whole words. It also allows incorporation of additional avail...
متن کامل